1 Big Picture

1.1 What is The Business Problem?

  • Identifying and targeting high-value customer segments to maximize customer lifetime value.

  • Developing personalized marketing strategies to increase customer engagement and loyalty.

  • Optimizing marketing spend and resource allocation based on customer segment characteristics.

  • Improving customer retention and reducing churn rates through targeted retention efforts.

  • Enhancing overall business profitability by focusing on the most valuable customer segments.

1.2 Motivation?

  • In the highly competitive e-commerce landscape, a one-size-fits-all approach is no longer effective for customer engagement and retention.

  • Understanding individual customer behavior and preferences is crucial for developing targeted marketing strategies that resonate with each segment.

  • Personalized communications, product recommendations, and promotions can significantly improve customer satisfaction, loyalty, and lifetime value.

  • The shift in consumer behavior post-COVID has accelerated the need for businesses to adopt data-driven, customer-centric strategies to remain competitive.

  • Effective customer segmentation and targeted marketing efforts can lead to higher conversion rates, increased average order value, and improved overall business performance.

2 Defining the Specific Business Challenge

The specific business challenge is to leverage customer transaction data to identify and characterize distinct customer segments based on their purchasing behavior, and to develop targeted marketing strategies that maximize customer engagement, retention, and profitability.

2.1 Key Objectives

  1. Analyze customer transaction data to identify meaningful customer segments using the RFM (Recency, Frequency, Monetary) framework.

  2. Develop a deep understanding of each segment’s characteristics, preferences, and purchasing patterns.

  3. Create personalized marketing strategies tailored to each segment, focusing on cross-selling, upselling, and retention efforts.

  4. Optimize marketing spend and resource allocation based on the value and potential of each customer segment.

  5. Monitor and assess the effectiveness of targeted marketing campaigns, making data-driven decisions to continually improve customer engagement and business performance.

By addressing this specific business challenge, the e-commerce company aims to enhance customer loyalty, increase customer lifetime value, and ultimately drive business growth and profitability in the competitive online marketplace.

3 Exploring the Dataset

3.1 Dataset Description

  • Source

  • Description

    • Marketing Insights for E-Commerce Company.
    • Transaction data has been provided for the period of 1/1/2019 to 12/31/2019.
  • Original

    • Online_Sales.csv

    • CustomersData.csv

    • Marketing_Spend.csv

    • Discount_Coupon.csv

    • Tax_amount.csv

Transformed Into…

  • eCommerce.csv

    • N = 52,924

    • 15 Variables

    • No Missing Values

3.2 Key Variables in the Dataset

  • CustomerID

    • Unique identifier for each customer
  • Transaction_ID

    • Unique identifier for each transaction
  • Transaction_Date

    • Date of the transaction
  • Product_SKU

    • Stock Keeping Unit for product identification
  • Product_Description

    • Description of the product
  • Product_Category

    • Category of the product
  • Quantity

    • Number of items purchased in the transaction
  • Avg_Price

    • Average price of the items
  • Delivery_Charges

    • Additional charges for product delivery
  • Coupon_Status

    • Indicates if a coupon was used
  • Month

    • Transaction month
  • Coupon_Code

    • Code of the coupon used
  • Discount_pct

    • Discount percentage applied
  • GST

    • Goods and Services Tax applied
  • Invoice

    • Total invoice amount after discounts and taxes

4 Levels of the Dataset

4.1 head(df) & tail(df)

4.1.1 head(df)

4.1.2 tail(df)

4.2 describe(df), str(df), & summary(df)

4.2.1 describe(df)

##                      vars     n          mean            sd     median
## CustomerID              1 52924 15346.7098103 1766.55602034 15311.0000
## Transaction_ID          2 52924 32409.8256745 8648.66897695 32625.5000
## Transaction_Date*       3 52924   186.8028872  101.72835957   194.0000
## Product_SKU*            4 52924   762.2928728  308.14316731   917.0000
## Product_Description*    5 52924   229.2883191  105.15812454   259.0000
## Product_Category*       6 52924    10.7620550    6.70540005    13.0000
## Quantity                7 52924     4.4976381   20.10471082     1.0000
## Avg_Price               8 52924    52.2376464   64.00688160    16.9900
## Delivery_Charges        9 52924    10.5176304   19.47561323     6.0000
## Coupon_Status*         10 52924     1.8295291    0.90444395     1.0000
## Month*                 11 52924     6.3685662    3.47001687     6.0000
## Coupon_Code*           12 52924    28.4509674   13.37335915    33.0000
## Discount_pct           13 52924    19.8023581    8.27887768    20.0000
## GST                    14 52924     0.1374618    0.04582478     0.1800
## Invoice                15 52924   101.9831975  172.36572872    45.6362
##                            trimmed         mad        min       max     range
## CustomerID           15360.9325224  2326.19940 12346.0000 18283.000  5937.000
## Transaction_ID       32385.1758621 10188.42720 16679.0000 48497.000 31818.000
## Transaction_Date*      187.6798299   124.53840     1.0000   365.000   364.000
## Product_SKU*           803.1309872   100.81680     1.0000  1145.000  1144.000
## Product_Description*   236.9775154    99.33420     1.0000   404.000   403.000
## Product_Category*       10.7015588     8.89560     1.0000    20.000    19.000
## Quantity                 1.5277279     0.00000     1.0000   900.000   899.000
## Avg_Price               42.3079039    21.03809     0.3900   355.740   355.350
## Delivery_Charges         6.9459700     0.00000     0.0000   521.360   521.360
## Coupon_Status*           1.7869154     0.00000     1.0000     3.000     2.000
## Month*                   6.3324516     4.44780     1.0000    12.000    11.000
## Coupon_Code*            28.9720123    14.82600     1.0000    46.000    45.000
## Discount_pct            19.8474256    14.82600     0.0000    30.000    30.000
## GST                      0.1417222     0.00000     0.0500     0.180     0.130
## Invoice                 74.1157519    50.78706     4.6035  8979.275  8974.672
##                              skew    kurtosis           se
## CustomerID           -0.032637368  -1.2343684  7.678936633
## Transaction_ID        0.005581206  -1.0254602 37.594381536
## Transaction_Date*    -0.066141763  -1.0976880  0.442196918
## Product_SKU*         -1.001510104  -0.5107442  1.339449091
## Product_Description* -0.513630742  -0.9404206  0.457105558
## Product_Category*    -0.082542602  -1.7885356  0.029147302
## Quantity             19.033722579 525.3827622  0.087391964
## Avg_Price             1.632487265   3.3417329  0.278227683
## Delivery_Charges     11.959061423 204.6103981  0.084657378
## Coupon_Status*        0.341717319  -1.6909090  0.003931473
## Month*                0.041208978  -1.2337928  0.015083609
## Coupon_Code*         -0.215521921  -1.6877942  0.058131854
## Discount_pct         -0.043117471  -1.3546626  0.035986958
## GST                  -0.351426450  -1.3971437  0.000199193
## Invoice              16.275019561 570.8286389  0.749246270

4.2.2 str(df)

## 'data.frame':    52924 obs. of  15 variables:
##  $ CustomerID         : int  17850 17850 17850 17850 17850 17850 17850 17850 17850 13047 ...
##  $ Transaction_ID     : int  16679 16680 16681 16682 16682 16682 16682 16682 16682 16682 ...
##  $ Transaction_Date   : chr  "2019-01-01" "2019-01-01" "2019-01-01" "2019-01-01" ...
##  $ Product_SKU        : chr  "GGOENEBJ079499" "GGOENEBJ079499" "GGOEGFKQ020399" "GGOEGAAB010516" ...
##  $ Product_Description: chr  "Nest Learning Thermostat 3rd Gen-USA - Stainless Steel" "Nest Learning Thermostat 3rd Gen-USA - Stainless Steel" "Google Laptop and Cell Phone Stickers" "Google Men's 100% Cotton Short Sleeve Hero Tee Black" ...
##  $ Product_Category   : chr  "Nest-USA" "Nest-USA" "Office" "Apparel" ...
##  $ Quantity           : int  1 1 1 5 1 15 15 15 5 52 ...
##  $ Avg_Price          : num  153.71 153.71 2.05 17.53 16.5 ...
##  $ Delivery_Charges   : num  6.5 6.5 6.5 6.5 6.5 6.5 6.5 6.5 6.5 6.5 ...
##  $ Coupon_Status      : chr  "Used" "Used" "Used" "Not Used" ...
##  $ Month              : chr  "Jan" "Jan" "Jan" "Jan" ...
##  $ Coupon_Code        : chr  "ELEC10" "ELEC10" "OFF10" "SALE10" ...
##  $ Discount_pct       : num  10 10 10 10 10 10 10 10 10 10 ...
##  $ GST                : num  0.1 0.1 0.1 0.18 0.18 0.18 0.18 0.18 0.18 0.1 ...
##  $ Invoice            : num  158.67 158.67 8.53 109.93 24.02 ...

4.2.3 summary(df)

##    CustomerID    Transaction_ID  Transaction_Date   Product_SKU       
##  Min.   :12346   Min.   :16679   Length:52924       Length:52924      
##  1st Qu.:13869   1st Qu.:25384   Class :character   Class :character  
##  Median :15311   Median :32626   Mode  :character   Mode  :character  
##  Mean   :15347   Mean   :32410                                        
##  3rd Qu.:16996   3rd Qu.:39126                                        
##  Max.   :18283   Max.   :48497                                        
##  Product_Description Product_Category      Quantity         Avg_Price     
##  Length:52924        Length:52924       Min.   :  1.000   Min.   :  0.39  
##  Class :character    Class :character   1st Qu.:  1.000   1st Qu.:  5.70  
##  Mode  :character    Mode  :character   Median :  1.000   Median : 16.99  
##                                         Mean   :  4.498   Mean   : 52.24  
##                                         3rd Qu.:  2.000   3rd Qu.:102.13  
##                                         Max.   :900.000   Max.   :355.74  
##  Delivery_Charges Coupon_Status         Month           Coupon_Code       
##  Min.   :  0.00   Length:52924       Length:52924       Length:52924      
##  1st Qu.:  6.00   Class :character   Class :character   Class :character  
##  Median :  6.00   Mode  :character   Mode  :character   Mode  :character  
##  Mean   : 10.52                                                           
##  3rd Qu.:  6.50                                                           
##  Max.   :521.36                                                           
##   Discount_pct       GST            Invoice        
##  Min.   : 0.0   Min.   :0.0500   Min.   :   4.604  
##  1st Qu.:10.0   1st Qu.:0.1000   1st Qu.:  20.160  
##  Median :20.0   Median :0.1800   Median :  45.636  
##  Mean   :19.8   Mean   :0.1375   Mean   : 101.983  
##  3rd Qu.:30.0   3rd Qu.:0.1800   3rd Qu.: 137.400  
##  Max.   :30.0   Max.   :0.1800   Max.   :8979.275

5 Correlations & Interesting Relationships

5.1 Correlations

5.1.1 Plots

5.1.1.1 Correlation Matrix

5.1.1.2 Quantity vs. Invoice Scatterplot

5.1.2 Interesting Relationships

  • As the Avg_Price or Invoice amount increases, the GST (Goods and Services Tax) as a percentage of the total purchase falls.

  • With an increase of Quantity comes an increase in total Invoice amount and Delivery_Charges.

  • The larger the Invoice amount, the larger the Delivery_Charges.

  • The slope of the linear regression line between Quantity and Invoice value is 4.37.

6 Analysis

6.1 Methodology

# RFM component: Monetary
# Calculate the total transaction amount for each line item in the dataset.
# The formula includes the product of quantity and average price, adds the
# Goods and Services Tax (GST), and also adds any delivery charges associated
# with the transaction.
df$amount <-
  df$Quantity * df$Avg_Price * (1 + df$GST) + df$Delivery_Charges

# Aggregate the total monetary value by each customer. This uses the 'aggregate' function to sum up all the
# amounts spent per customer, identified by 'CustomerID'.
# The result is stored in 'rfm.m', which holds the
# total monetary value for each customer.
rfm.m <- aggregate(amount ~ CustomerID, data = df, sum)

# RFM component: Recency
# Identify the latest date in the dataset.
# This is done by accessing the last row's date from the
# 'Transaction_Date' column, assuming the data is ordered by date.
latestdate <-
  df$Transaction_Date[nrow(df)]  # last date in the dataset

# Calculate the number of days since the latest transaction
# for each transaction in the dataset.
# 'difftime' computes the difference in days between the 'latestdate' and each transaction's date.
df$days_since_latest <-
  as.integer(difftime(latestdate, df$Transaction_Date, units = "days"))

# Aggregate the minimum days since last transaction by each custome
# to find out the most recent purchase.
# This uses 'aggregate' to find the minimum 'days_since_latest' for each 'CustomerID', storing the result in 'rfm.r'.
rfm.r <- aggregate(days_since_latest ~ CustomerID, data = df, min)

# RFM component: Frequency
# Calculate the frequency of transactions for each customer. The frequency is determined by counting the
# unique transaction IDs per customer. This involves using the
# 'aggregate' function with a custom function
# that calculates the length of unique transaction IDs.
rfm.f <- aggregate(
  Transaction_ID ~ CustomerID,
  data = df,
  FUN = function(x)
    length(unique(x))
)

# Combine RFM
# Combine the results from the monetary, recency, and
# frequency calculations into a single data frame.
# 'cbind' combines the data frames by columns.
rfm <- cbind(rfm.m, rfm.r, rfm.f)

# Convert the combined object into a dataframe if it is not already.
rfm <- as.data.frame(rfm)

# Remove duplicate rows based on all columns, ensuring that each row
# in 'rfm' is unique.
rfm <- rfm[!duplicated(as.list(rfm))]

# Rename the columns to more descriptive names reflecting the RFM components.
names(rfm)[names(rfm) == "Transaction_ID"] <- "frequency"
names(rfm)[names(rfm) == "amount"] <- "monetary"
names(rfm)[names(rfm) == "days_since_latest"] <- "recency"

# Remove redundant datasets
# Clean up the environment by removing temporary data frames
# used in the calculation process.
# This helps free up memory and avoids clutter in the workspace.
rm(rfm.f, rfm.r, rfm.m)

6.1.1 Histograms

6.1.2 RFM Quintile & Scatterplot

## Correlation between RFM R score and Recency: -0.96

6.1.3 RFM Segments Table & Heatmaps

RFM Table
R_Level F_Level M_Level Customers Average Recency Average Frequency Average Monetary
High High High 296 43.1047 41.3007 9,347.5797
High High Medium 17 42.1765 16.0000 2,559.1191
High Low High 2 81.5000 2.0000 6,131.4825
High Low Low 50 49.8600 1.8600 266.7431
High Low Medium 20 42.6000 2.7500 1,019.4593
High Medium High 23 34.3478 11.7826 3,515.0455
High Medium Low 14 52.1429 5.8571 477.5779
High Medium Medium 164 46.1402 8.6829 1,695.1207
Low High High 71 305.4366 31.5211 7,482.2911
Low High Medium 7 287.7143 17.7143 2,576.4558
Low Low High 1 268.0000 4.0000 7,753.2688
Low Low Low 73 299.7260 1.6301 211.3750
Low Low Medium 19 300.3684 3.0000 1,245.1215
Low Medium High 9 292.7778 12.0000 3,781.8340
Low Medium Low 9 299.3333 4.5556 458.2323
Low Medium Medium 105 304.8762 7.9143 1,589.7832
Medium High High 167 159.8443 34.2635 7,151.9321
Medium High Medium 28 157.4286 17.4643 2,511.5858
Medium Low Low 111 164.0721 1.8468 194.0639
Medium Low Medium 18 163.0000 2.8333 1,035.4046
Medium Medium High 17 168.8235 11.1765 4,508.9259
Medium Medium Low 37 171.7297 5.6486 453.6616
Medium Medium Medium 210 168.4857 8.6524 1,526.3793

6.2 Results

Average Discount by Customer Segment
Group Avg_Discount
Champions (Best customers with highest recency and frequency) 21.82%
Loyal customers (High recency and frequency, but not as recent as Champions) 19.08%
Potential Loyalists (High recency but lower frequency than Loyal customers) 17.20%
New Customers (Highest recency but lowest frequency) 27.67%
Promising (High recency but lowest frequency) 15.89%
Need attention (Medium recency and frequency) 20.84%
About to sleep (Medium recency but low frequency) 21.25%
At risk (Low recency and medium frequency) 19.56%
Can’t lose them (Low recency but highest frequency) 16.98%
Hibernating (Low recency and low frequency) 18.33%
## The table above shows the average discount percentage for each customer segment based on their recency and frequency scores.
##     Some key observations:
##     - Champions and Loyal customers, who have high recency and frequency, tend to have higher average discounts compared to other segments.
##     - New Customers and Promising segments, despite high recency, have lower average discounts due to their low frequency.
##     - Hibernating customers, with low recency and frequency, have the lowest average discounts among all segments.

7 Empirical Challenges Faced

During the course of this RFM segmentation project, several empirical challenges were encountered. These challenges highlight the complexities and considerations involved in working with real-world customer data. Some of the key challenges faced include:

7.1 Data Quality

One of the primary challenges was ensuring the quality and integrity of the customer transaction data. The dataset required thorough cleaning, preprocessing, and validation to handle missing values, outliers, and inconsistencies. Ensuring accurate and reliable data was crucial for the validity of the RFM analysis and segmentation results.

7.2 Handling Large Datasets

The e-commerce dataset contained a substantial number of customer records and transactions. Processing and analyzing large volumes of data presented computational challenges. Efficient data manipulation techniques and optimized algorithms were necessary to handle the dataset effectively and generate timely insights.

7.3 Defining RFM Metrics

Determining the appropriate definitions and thresholds for the Recency, Frequency, and Monetary metrics was a critical challenge. The selection of suitable time frames, such as considering the most recent transaction date or the entire transaction history, required careful consideration. Setting appropriate thresholds for categorizing customers into different RFM segments also involved iterative experimentation and domain expertise.

7.4 Handling Skewed Distributions

The distributions of the RFM metrics, particularly the Frequency and Monetary values, exhibited significant skewness. Dealing with skewed data required applying appropriate transformations, such as logarithmic scaling, to achieve more normalized distributions. This ensured that the RFM scores and segments accurately reflected the underlying customer behavior patterns.

7.5 Addressing Outliers

The presence of outliers in the dataset, such as customers with exceptionally high or low RFM values, posed challenges in the segmentation process. Outliers can potentially distort the RFM scores and segment assignments. Identifying and handling outliers required careful consideration, such as setting appropriate thresholds or applying robust statistical techniques to mitigate their impact.

7.6 Selecting Optimal Number of Segments

Determining the optimal number of customer segments was another empirical challenge. While the RFM framework suggests using quintiles, the specific number of segments may vary based on the business context and data characteristics. Balancing segment granularity and interpretability required iterative experimentation and evaluation of different segmentation approaches.

7.7 Validation and Interpretation

Validating the RFM segmentation results and interpreting the characteristics of each segment posed challenges. Assessing the stability and robustness of the segments across different time periods or data subsets was important to ensure the reliability of the insights. Interpreting the segment characteristics and translating them into actionable business strategies required collaboration with domain experts and stakeholders.

Addressing these empirical challenges required a combination of data preprocessing techniques, statistical methods, and domain knowledge. Iterative refinement, sensitivity analysis, and collaborative discussions with business stakeholders were essential to overcome these challenges and derive meaningful insights from the RFM segmentation analysis.

8 Conclusions

8.1 Business Recommendation

8.1.1 Can’t Lose Them

  • Offer exclusive discounts or incentives.

  • Personalize marketing communications and product recommendations.

  • Consider subscription options or VIP benefits.

8.1.2 At Risk

  • Create compelling promotions like BOGO deals or bundle pricing.

  • Develop targeted content to restore interest.

  • Use retargeting ads and personalized messaging.

8.1.3 Hibernating

  • Focus on brand awareness and re-engagement.

  • Encourage social media interactions.

  • Consider reactivation campaigns with incentives.

8.2 Next Steps

8.2.1 Integration with Other Models

  • Predictive Analytics

    • Forecast customer behavior.

    • Proactively target customers with personalized offers.

    • Predict likelihood to purchase, churn probability, or expected lifetime value.

  • Sentiment Analysis

    • Analyze customer reviews, feedback, and social media interactions.

    • Gain insights into customer satisfaction, preferences, and pain points.

    • Tailor marketing messages and improve product offerings.

8.2.2 Enriching Data

  • Customer Demographics

    • Incorporate age, gender, location, income level, or education data.

    • Create comprehensive customer profiles.

    • Personalize marketing campaigns and product recommendations.

  • Social Media Interactions

    • Integrate likes, shares, comments, or mentions data.

    • Understand customer engagement and brand perception.

    • Identify brand advocates and influencer potential.

8.2.3 Performance Metrics

  • Key Performance Indicators (KPIs)

    • Establish clear KPIs for each RFM segment.

    • Monitor response rates, conversion rates, average order value, customer lifetime value, retention rates, or net promoter score.

    • Assess the effectiveness of marketing initiatives.

  • Dashboards

    • Develop interactive dashboards for each RFM segment.

    • Display key metrics, trends, and comparisons across segments.

    • Facilitate data-driven decision making.

8.2.4 Continuous Refinement

  • Regular Updates

    • Update RFM segmentation model with the latest customer transaction data.

    • Adapt segmentation and marketing strategies to evolving customer preferences and market conditions.

  • A/B Testing

    • Evaluate the effectiveness of different marketing approaches for each segment.

    • Test variations in messaging, offers, channels, or creative elements.

    • Identify the most impactful strategies for each segment.

  • Customer Feedback

    • Actively seek and incorporate customer feedback.

    • Conduct surveys, focus groups, or customer interviews.

    • Gather qualitative insights into customer preferences, expectations, and experiences.

    • Refine segment definitions, identify new opportunities, and address gaps.

By integrating RFM segmentation with other models, enriching data, establishing performance metrics, and continuously refining the approach, businesses can create a comprehensive and dynamic customer segmentation strategy. This holistic approach enables companies to effectively target, engage, and retain customers, ultimately driving business growth and profitability in the ever-evolving e-commerce landscape.